10 research outputs found
HaVQA: A Dataset for Visual Question Answering and Multimodal Research in Hausa Language
This paper presents HaVQA, the first multimodal dataset for visual
question-answering (VQA) tasks in the Hausa language. The dataset was created
by manually translating 6,022 English question-answer pairs, which are
associated with 1,555 unique images from the Visual Genome dataset. As a
result, the dataset provides 12,044 gold standard English-Hausa parallel
sentences that were translated in a fashion that guarantees their semantic
match with the corresponding visual information. We conducted several baseline
experiments on the dataset, including visual question answering, visual
question elicitation, text-only and multimodal machine translation.Comment: Accepted at ACL 2023 as a long paper (Findings
EFaR 2023: Efficient Face Recognition Competition
This paper presents the summary of the Efficient Face Recognition Competition
(EFaR) held at the 2023 International Joint Conference on Biometrics (IJCB
2023). The competition received 17 submissions from 6 different teams. To drive
further development of efficient face recognition models, the submitted
solutions are ranked based on a weighted score of the achieved verification
accuracies on a diverse set of benchmarks, as well as the deployability given
by the number of floating-point operations and model size. The evaluation of
submissions is extended to bias, cross-quality, and large-scale recognition
benchmarks. Overall, the paper gives an overview of the achieved performance
values of the submitted solutions as well as a diverse set of baselines. The
submitted solutions use small, efficient network architectures to reduce the
computational cost, some solutions apply model quantization. An outlook on
possible techniques that are underrepresented in current solutions is given as
well.Comment: Accepted at IJCB 202
CNN Patch Pooling for Detecting 3D Mask Presentation Attacks in NIR
Presentation attacks using 3D masks pose a serious threat to face recognition systems. Automatic detection of these attacks is challenging due to hyper-realistic nature of masks. In this work, we consider presentations acquired in near infrared (NIR) imaging channel for detection of mask-based attacks. We propose a patch pooling mechanism to learn complex textural features from lower layers of a convolutional neural network CNN). The proposed patch pooling layer can be used in conjunction with a pretrained face recognition CNN without fine-tuning or adaptation. The pretrained CNN, in fact, can also be trained from visual spectrum data. We demonstrate efficacy of the proposed method on mask attacks in NIR channel from WMCA and MLFP datasets. It achieves near perfect results on WMCA data, and outperforms existing benchmark on MLFP dataset by a large margin
CNN Patch Pooling for Detecting 3D Mask Presentation Attacks in NIR
Presentation attacks using 3D masks pose a serious threat to face recognition systems. Automatic detection of these attacks is challenging due to hyper-realistic nature of masks. In this work, we consider presentations acquired in near infrared (NIR) imaging channel for detection of mask-based attacks. We propose a patch pooling mechanism to learn complex textural features from lower layers of a convolutional neural network (CNN). The proposed patch pooling layer can be used in conjunction with a pretrained face recognition CNN without fine-tuning or adaptation. The pretrained CNN, in fact, can also be trained from visual spectrum data. We demonstrate efficacy of the proposed method on mask attacks in NIR channel from WMCA and MLFP datasets. It achieves near perfect results on WMCA data, and outperforms existing benchmark on MLFP dataset by a large margin
Multispectral Deep Embeddings As a Countermeasure To Custom Silicone Mask Presentation Attacks
This work focuses on detecting presentation attacks (PA) mounted using custom silicone masks. Face recognition (FR) systems have been shown to be highly vulnerable to PAs based on such masks [1, 2]. Here
we explore the use of multispectral data (color imagery, near infrared (NIR) imagery and thermal imagery) for face presentation attack detection (PAD), specifically against the custom silicone mask attacks. Using a
new dataset (XCSMAD) representing 21 custom made masks, we establish the baseline performance of several commonly used face-PAD methods, on the different imaging channels. Considering thermal imagery in particular, our experiments show that low-cost thermal imaging devices are as effective in face-PAD as more expensive thermal cameras, for mask-based attacks. This result reinforces the case for the use of thermal data in face-PAD.
We also demonstrate that fusing information from multiple channels leads to significant improvement in face-PAD performance. Finally, we propose a new approach to face-PAD of custom silicone masks using a convolutional neural network (CNN). On individual spectral channels, the proposed approach achieves state-of-the-art results. Using multispectral-fusion, the proposed CNN-based method significantly outperforms the
baseline methods. The new dataset and source-code for our experiments is freely available for research purposes
Detection of Age-Induced Makeup Attacks on Face Recognition Systems Using Multi-Layer Deep Features
Makeup is a simple and easy instrument that can alter the appearance of a person’s face, and hence, create a presentation attack on face recognition (FR) systems. These attacks, especially the ones mimicking ageing, are difficult to detect due to their close resemblance with genuine (non-makeup) appearances. Makeups can also degrade the performance of recognition systems and of various algorithms that use human face as an input. The detection of facial makeups is an effective prohibitory measure to minimize these problems. This work proposes a deep learning-based presentation attack detection (PAD) method to identify facial makeups. We propose the use of a convolutional neural network (CNN) to extract features that can distinguish between presentations with age-induced facial makeups (attacks), and those without makeup (bona-fide). These feature descriptors, based on shape and texture cues, are constructed from multiple intermediate layers of a CNN. We introduce a new dataset AIM (Age Induced Makeups) consisting of 200+ video presentations of old-age makeups and bona-fide, each. Our experiments indicate makeups in AIM result in 14% decrease in the median matching scores of a recent CNN-based FR system. We demonstrate accuracy of the proposed PAD method where 93% presentations in the AIM dataset are correctly classified. In additional testing, it also outperforms existing methods of detection of generic makeups. A simple score-level fusion, performed on the classification scores of shape- and texture-based features, can further improve the accuracy of the proposed makeup detector
Bengali Visual Genome 1.0
Data
-------
Bengali Visual Genome (BVG for short) 1.0 has similar goals as Hindi Visual Genome (HVG) 1.1: to support the Bengali language. Bengali Visual Genome 1.0 is the multi-modal dataset in Bengali for machine translation and image
captioning. Bengali Visual Genome is a multimodal dataset consisting of text and images suitable for English-to-Bengali multimodal machine translation tasks and multimodal research. We follow the same selection of short English segments (captions) and the associated images from Visual Genome as HGV 1.1 has. For BVG, we manually translated these captions from English to Bengali taking the associated images into account. The manual translation is performed by the native Bengali speakers without referring to any machine translation system.
The training set contains 29K segments. Further 1K and 1.6K segments are provided in development and test sets, respectively, which follow the same (random) sampling from the original Hindi Visual Genome. A third test set is
called the ``challenge test set'' and consists of 1.4K segments. The challenge test set was created for the WAT2019 multi-modal task by searching for (particularly) ambiguous English words based on the embedding similarity and
manually selecting those where the image helps to resolve the ambiguity. The surrounding words in the sentence however also often include sufficient cues to identify the correct meaning of the ambiguous word.
Dataset Formats
---------------
The multimodal dataset contains both text and images.
The text parts of the dataset (train and test sets) are in simple tab-delimited plain text files.
All the text files have seven columns as follows:
Column1 - image_id
Column2 - X
Column3 - Y
Column4 - Width
Column5 - Height
Column6 - English Text
Column7 - Bengali Text
The image part contains the full images with the corresponding image_id as the file name. The X, Y, Width and Height columns indicate the rectangular region in the image described by the caption.
Data Statistics
---------------
The statistics of the current release are given below.
Parallel Corpus Statistics
--------------------------
Dataset Segments English Words Bengali Words
---------- -------- ------------- -------------
Train 28930 143115 113978
Dev 998 4922 3936
Test 1595 7853 6408
Challenge Test 1400 8186 6657
---------- -------- ------------- -------------
Total 32923 164076 130979
The word counts are approximate, prior to tokenization.
Citation
--------
If you use this corpus, please cite the following paper:
@inproceedings{hindi-visual-genome:2022,
title= "{Bengali Visual Genome: A Multimodal Dataset for Machine Translation and Image Captioning}",
author={Sen, Arghyadeep
and Parida, Shantipriya
and Kotwal, Ketan
and Panda, Subhadarshi
and Bojar, Ond{\v{r}}ej
and Dash, Satya Ranjan},
editor={Satapathy, Suresh Chandra
and Peer, Peter
and Tang, Jinshan
and Bhateja, Vikrant
and Ghosh, Anumoy},
booktitle= {Intelligent Data Engineering and Analytics},
publisher= {Springer Nature Singapore},
address= {Singapore},
pages = {63--70},
isbn = {978-981-16-6624-7},
doi = {10.1007/978-981-16-6624-7_7},